Statistical Thesaurus Construction for a Morphologically Rich Language
نویسندگان
چکیده
Corpus-based thesaurus construction for Morphologically Rich Languages (MRL) is a complex task, due to the morphological variability of MRL. In this paper we explore alternative term representations, complemented by clustering of morphological variants. We introduce a generic algorithmic scheme for thesaurus construction in MRL, and demonstrate the empirical benefit of our methodology for a Hebrew thesaurus.
منابع مشابه
ThesWB: A Tool for Thesaurus Construction from HTML Documents
Electronically available documents on the Web are exploding at an ever-increasing rate. Many Web documents, however, contain rich knowledge that describes the document's content. The Web can be viewed as a body of text containing two fundamentally different types of data: the contents and the tags. A tag is in HTML (Hyper-Text Markup Language) meta-data describing the layout and linking structu...
متن کاملImproving Translation to Morphologically Rich Languages (Améliorer la traduction des langages morphologiquement riches) [in French]
Améliorer la traduction des langages morphologiquement riches While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed versus previous generation rule-based systems. Current research in statistical techniques for translating to morphologically rich languages varies greatly ...
متن کاملExploration and Study of Chinese Thesaurus Automation Construction for Digital Libraries
The paper aims to explore Chinese thesaurus automation construction based on the freely available digital library resources. The key methods and study results are presented in the paper. The study adopted the technology of natural language processing to analysis the linguistics characteristics of terms, and combined with statistical analysis to extract the terms from technical literatures. Our ...
متن کاملRich Morphology Generation Using Statistical Machine Translation
We present an approach for generation of morphologically rich languages using statistical machine translation. Given a sequence of lemmas and any subset of morphological features, we produce the inflected word forms. Testing on Arabic, a morphologically rich language, our models can reach 92.1% accuracy starting only with lemmas, and 98.9% accuracy if all the gold features are provided.
متن کاملSemi-Automatic Practical Ontology Construction by Using a Thesaurus, Computational Dictionaries, and Large Corpora
This paper presents the semi-automatic construction method of a practical ontology by using various resources. In order to acquire a reasonably practical ontology in a limited time and with less manpower, we extend the Kadokawa thesaurus by inserting additional semantic relations into its hierarchy, which are classified as case relations and other semantic relations. The former can be obtained ...
متن کامل